Kristian K Ullrich
Max Planck Institute for Evolutionary Biology
A matter of distance.
Synonymous and NonSynonymous Substitutions: nucleotide substitutions might lead to changes on amino acid level.
Typically used to:
An R/Bioconductor package to calculate pairwise distances between all sequences of a MultipleSequenceAlignment (DNAStringSet or AAStringSet) and to conduct codon based analysis.
MSA2dist Features:
From Bioconductor:
From GitHub:
Data: Whole-genome coding sequences and gene ranges for Chlorophyta algae + Physcomitrium patens as an outgroup.
Source: Pico-PLAZA 3.0
Where to find: the data/ directory in the GitHub repo associated with this presentation.
data
├── clusters.rda
├── codingsequences.rda
└── data_acquisition_cds.md
codingsequences.rda: DNAStringSet object.clusters.rda: Pre-calculated syntenet clusters.codingsequences# Load necessary libraries
library(MSA2dist)
library(Biostrings)
library(tidyr)
library(dplyr)
library(stringr)
# Load data set codingsequences
load(here::here("data", "codingsequences.rda"))
# Inspect data
head(names(codingsequences))[1] "AP00G00010" "AP00G00020" "AP00G00030" "AP00G00040" "AP00G00050"
[6] "AP00G00060"
clusters Gene Cluster
1 Chlo_CNC64A_028G00030 1
2 Chlo_CNC64A_028G00040 2
3 Chlo_CNC64A_028G00070 3
4 Chlo_CNC64A_028G00080 4
5 Chlo_CNC64A_028G00110 5
6 Chlo_CNC64A_028G00140 6
[1] 22605
1 2 3 4 5 6
38 38 37 2 9 4
# Get first cluster of size 10 (fc10)
fc10 <- which(table(clusters$Cluster)==10)[1]
my_cluster <- clusters %>%
dplyr::filter(Cluster==fc10)
head(my_cluster) Gene Cluster
1 Oluc_OL13G02120 119
2 Bpra_BPRRCC1105_04G02530 119
3 Osp_ORCC809_13G00870 119
4 Oluc_OL21G00860 119
5 Omed_OM_08G02190 119
6 Msp_MRCC299_06G03520 119
# Remove species unique identifier and add as GeneID
my_cluster$GeneID <- stringr::str_split_fixed(my_cluster$Gene, "_", 2)[, 2]
head(my_cluster) Gene Cluster GeneID
1 Oluc_OL13G02120 119 OL13G02120
2 Bpra_BPRRCC1105_04G02530 119 BPRRCC1105_04G02530
3 Osp_ORCC809_13G00870 119 ORCC809_13G00870
4 Oluc_OL21G00860 119 OL21G00860
5 Omed_OM_08G02190 119 OM_08G02190
6 Msp_MRCC299_06G03520 119 MRCC299_06G03520
# Fetch coding sequences
my_cds <- codingsequences[match(my_cluster$GeneID,
names(codingsequences))]
head(my_cds)DNAStringSet object of length 6:
width seq names
[1] 1605 ATGTCGCTCAAGTCCATCAACCC...GCGTGAACATGCGCAAGCGATGA OL13G02120
[2] 1623 ATGTCCTCCTCTCTCCGTTCGGT...GCGGTCGCGGACCGGCTGAATAA BPRRCC1105_04G02530
[3] 1605 ATGTCGCTCAAATCGATCAACCC...GCATCAACATGCGAAAGAAATAG ORCC809_13G00870
[4] 1605 ATGTCGCTCAAGTCCATCAACCC...GCGTGAACATGCGCAAGCGATGA OL21G00860
[5] 1605 ATGAGTTTGAAATCCGTCAACCC...GGGTGAACATGCGCAAACGATAA OM_08G02190
[6] 1707 ATGCAGGGCATGCAAAGCCCCAT...CTCCCGAAAAACTGGGGGAGTAA MRCC299_06G03520
To calculate a coding sequence alignment for two sequences:
# Get coding sequence alignment
cds1_cds2_aln <- cds2codonaln(cds1 = my_cds[1], cds2 = my_cds[2], remove.gaps = FALSE)
cds1_cds2_alnDNAStringSet object of length 2:
width seq names
[1] 1623 ATGTCG------CTCAAGTCCAT...GCAAGCGA------------TGA OL13G02120
[2] 1623 ATGTCCTCCTCTCTCCGTTCGGT...GCGGTCGCGGACCGGCTGAATAA BPRRCC1105_04G02530
To calculate dN/dS from this alignment:
# Calculate dN/dS for this alignment; model = Li
cds1_cds2_kaks <- dnastring2kaks(cds1_cds2_aln,
model = "Li", threads = 1, isMSA = TRUE)
cds1_cds2_kaks Comp1 Comp2 seq1 seq2 ka ks vka
1 1 2 OL13G02120 BPRRCC1105_04G02530 0.1640926 1.849032 0.0003232055
vks
1 0.4047548
Other Biostrings::pairwiseAlignment options can be set:
type = "global", substitutionMatrix = "BLOSUM62"
gapOpening = 10, gapExtension = 0.5
To calculate all pairwise combinations:
start_time <- Sys.time()
# Calculate dN/dS for the whole cluster; model = YN
my_cds_kaks <- dnastring2kaks(my_cds,
model = "YN", threads = 2, isMSA = FALSE)
end_time <- Sys.time()
head(my_cds_kaks) Comp1 Comp2 seq1 seq2 Method Ka Ks
result.1 1 2 OL13G02120 BPRRCC1105_04G02530 YN 0.145529 4.33579
result.2 1 3 OL13G02120 ORCC809_13G00870 YN 0.0328853 2.73769
result.3 1 4 OL13G02120 OL21G00860 YN NA NA
result.4 1 5 OL13G02120 OM_08G02190 YN 0.0570134 4.34114
result.5 1 6 OL13G02120 MRCC299_06G03520 YN 0.672847 4.12155
result.6 1 7 OL13G02120 MP09G03570 YN 0.665727 4.08101
Ka/Ks P-Value(Fisher) Length S-Sites N-Sites Fold-Sites(0:2:4)
result.1 0.0335645 3.77014e-136 1602 324.101 1277.9 NA
result.2 0.0120121 3.37981e-125 1602 313.774 1288.23 NA
result.3 NA NA 1602 313.942 1288.06 NA
result.4 0.0131333 1.79968e-161 1602 326.422 1275.58 NA
result.5 0.163251 9.8868e-64 1560 243.569 1316.43 NA
result.6 0.163128 5.33907e-66 1560 230.753 1329.25 NA
Substitutions S-Substitutions N-Substitutions
result.1 443 274.184 168.816
result.2 238 196.59 41.4101
result.3 0 NA NA
result.4 326 255.985 70.0148
result.5 819 236.086 582.914
result.6 811 226.544 584.456
Fold-S-Substitutions(0:2:4) Fold-N-Substitutions(0:2:4)
result.1 NA NA
result.2 NA NA
result.3 NA NA
result.4 NA NA
result.5 NA NA
result.6 NA NA
Divergence-Time Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA)
result.1 0.993263 1.65453:1.65453:1:1:1:1
result.2 0.562657 2.66376:2.66376:1:1:1:1
result.3 NA 2:2:1:1:1:1
result.4 0.929943 1.39012:1.39012:1:1:1:1
result.5 1.21131 1.36501:1.36501:1:1:1:1
result.6 1.17091 1.09442:1.09442:1:1:1:1
GC(1:2:3) ML-Score AICc Akaike-Weight Model
result.1 0.537893(0.536969:0.358595:0.718115) NA NA NA NA
result.2 0.561682(0.562617:0.349533:0.772897) NA NA NA NA
result.3 0.561371(0.568224:0.35514:0.760748) NA NA NA NA
result.4 0.542368(0.561682:0.35514:0.71028) NA NA NA NA
result.5 0.568325(0.560034:0.359348:0.785592) NA NA NA NA
result.6 0.568442(0.544674:0.359107:0.801546) NA NA NA NA
How long did it take?
ka_mat <- my_cds_kaks %>%
dplyr::select(seq1, seq2, Ka) %>%
dplyr::mutate(Ka = as.numeric(Ka)) %>%
tidyr::pivot_wider(names_from = "seq1", values_from = Ka)
head(ka_mat)# A tibble: 6 × 10
seq2 OL13G…¹ BPRRC…² ORCC8…³ OL21G…⁴ OM_08…⁵ MRCC2…⁶ MP09G…⁷ MP09G…⁸ OT_13…⁹
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BPRRC… 0.146 NA NA NA NA NA NA NA NA
2 ORCC8… 0.0329 0.151 NA NA NA NA NA NA NA
3 OL21G… NA 0.146 0.0329 NA NA NA NA NA NA
4 OM_08… 0.0570 0.146 0.0719 0.0570 NA NA NA NA NA
5 MRCC2… 0.673 0.699 0.653 0.673 0.715 NA NA NA NA
6 MP09G… 0.666 0.696 0.646 0.666 0.678 0.0486 NA NA NA
# … with abbreviated variable names ¹OL13G02120, ²BPRRCC1105_04G02530,
# ³ORCC809_13G00870, ⁴OL21G00860, ⁵OM_08G02190, ⁶MRCC299_06G03520,
# ⁷MP09G03570, ⁸MP09G02740, ⁹OT_13G02270
# Extract sequence names
seq_names <- c(colnames(ka_mat)[-1], rev(ka_mat$seq2)[1])
# Add NA to fill into square distance matrix
ka_mat <- NA |> rbind(ka_mat[, -1]) |> cbind(NA) |>
tidyr::as_tibble()
# Set column names
colnames(ka_mat) <- seq_names
head(ka_mat)# A tibble: 6 × 10
OL13G02120 BPRRCC110…¹ ORCC8…² OL21G…³ OM_08…⁴ MRCC2…⁵ MP09G…⁶ MP09G…⁷ OT_13…⁸
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA NA NA NA NA NA NA NA NA
2 0.146 NA NA NA NA NA NA NA NA
3 0.0329 0.151 NA NA NA NA NA NA NA
4 NA 0.146 0.0329 NA NA NA NA NA NA
5 0.0570 0.146 0.0719 0.0570 NA NA NA NA NA
6 0.673 0.699 0.653 0.673 0.715 NA NA NA NA
# … with 1 more variable: MRCC299_06G02640 <lgl>, and abbreviated variable
# names ¹BPRRCC1105_04G02530, ²ORCC809_13G00870, ³OL21G00860, ⁴OM_08G02190,
# ⁵MRCC299_06G03520, ⁶MP09G03570, ⁷MP09G02740, ⁸OT_13G02270
ape::bionjs reconstructs a phylogenetic tree from a distance matrix with possibly missing values:
Here, a pre-calculated MSA is necessary:
Plot average behavior of each codon:
Kristian K Ullrich @kullrich